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The last century saw the application of Boolean algebra toward the construction 
of computing machines, which work by applying logical transformations to infor- 
mation contained in their memory. The development of information theory and the 
generalization of Boolean algebra to Bayesian inference have enabled these comput- 
ing machines, in the last quarter of the twentieth century, to be endowed with the 
ability to learn by making inferences from data. This revolution is just beginning as 
new computational techniques continue to make difficult problems more accessible. 
However, modern intelligent machines work by inferring knowledge using only their 
pre-programmed prior knowledge and the data provided. They lack the ability to 
ask questions, or request data that would aid their inferences. 

Recent advances in understanding the foundations of probability theory have re- 
vealed implications for areas other than logic. Of relevance to intelligent machines, 
we identified the algebra of questions as the free distributive algebra, which now 
allows us to work with questions in a way analogous to that which Boolean algebra 
enables us to work with logical statements. In this paper we describe this logic 
of inference and inquiry using the mathematics of partially ordered sets and the 
scaffolding of lattice theory, discuss the far-reaching implications of the methodol- 
ogy, and demonstrate its application with current examples in machine learning. 
Automation of both inference and inquiry promises to allow robots to perform sci- 
ence in the far reaches of our solar system and in other star systems by enabling 
them to not only make inferences from data, but also decide which question to ask, 
experiment to perform, or measurement to take given what they have learned and 
what they are designed to understand. 
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1. Introduction 

James Bernoulli (1713) was among the first to realize the difference between deduc- 
tive logic used in situations of certain knowledge and inductive logic, which is nec- 
essary for the uncertain situations found in everyday problems. In Ars Conjectandi , 
( The Art of Conjecture ), he was the first to quantify uncertainty by identifying a set 
of equally possible hypotheses. This allowed him to calculate the number of ways 
in which a given a situation could occur relative to the total number of possible 
outcomes. He also recognized that what we perceive as chance events could be in- 
terpreted as regular predictable events if we were more knowledgeable: “The chance 
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depends mainly upon our knowledge.”. As we will see, this statement reflects what 
we consider to be a modern view of probability as extended logic. 

While Bernoulli became adept at enumerating possibilities and using them to 
calculate probabilities, he was unable to use the outcomes to make inferences about 
the way in which an observed situation could have occurred. Reverend Thomas 
Bayes (1763) turned the situation around and made inferences about the causes us- 
ing the outcomes. While this rule still carries his name, it was Pierre-Simon Laplace 
who independently rediscovered Bayes’ Theorem (1812), presented it in its modern 
form, and went on to utilize it to solve problems in astronomy, geodesy, instru- 
mentation, error estimation, population, jurisprudence, and procedures of electoral 
bodies. Laplace’s interpretation of Bayes’ Theorem as an extension of logic led di- 
rectly to his extremely prolific application of the methodology, and can be summed 
up in a translated quote from Theorie Analytique des Probabilities: “Probability 
theory is nothing but common sense reduced to calculation.” 

After Laplace, the mathematicians of the nineteenth century worked to rigor- 
ously develop probability theory. As a general theory of inference, it was too difficult 
to derive useful theorems. So the range of applications of the theory was reduced 
to relatively simple problems involving frequencies of event occurrences, which con- 
sequently led to frequentist statistics — a field which continues to confuse students 
with a bewildering array of statistical tests. It is quite amazing how the specificity 
of modern frequentist methods stands in such stark contrast to the generality of this 
theory at its conception. While others, most notably Jeffreys (1939), attempted to 
resurrect the general theory of inference, the renaissance would have to wait until 
some key insights were made in the mid-twentieth century. 

The information technology revolution in the last half of the twentieth century 
was due in great part to Claude E. Shannon’s masterpiece ‘The Mathematical 
Theory of Communication’ (1948). In it he single-handedly developed what is now 
called information theory , which has driven telecommunications, coding theory, 
signal analysis, and computer science as a whole. Key to this development was the 
concept of information-theoretic entropy . It was designed for use in communications, 
and can be thought of as a measure of the degree of uncertainty of which message, 
from a set of possible messages, will be received in a communication channel. The 
name entropy, however, has created much confusion. Myron Tribus (1971) recalls 
Shannon explaining how von Neumann suggested that he should call his measure 
‘entropy’ because the same function was already employed in statistical mechanics, 
and more importantly, that “nobody knows what entropy really is, so in a discussion 
you will always have an advantage.” Much confusion ensued due to the fact that 
Shannon’s application of informat ion- theoretic entropy was so specific, yet it was 
so similar to the poorly-understood entropy in use for over 60 years in physics. The 
great insights of the next character in our story clears up these mysteries. 

While Shannon demonstrated how one could use the probabilities of a set of 
messages to compute the degree of uncertainty, Edwin Thompson Jaynes computed 
probabilities based on a maximal degree of uncertainty. In this way he developed 
the Principle of Maximum Entropy (Jaynes 1957, 1979), which allows one to assign 
probabilities in the event that one possesses some knowledge in the form of con- 
straints. This allows entropy to be used as an inference tool. By maximizing the 
entropy subject to these constraints, one obtains a set of probabilities that are as 
noncommittal as possible while agreeing with what is known. Furthermore, Jaynes 
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(1957) showed that this is precisely the situation encountered in statistical mechan- 
ics, where the solutions are those which maximize the entropy subject to constraints 
such as the total energy of the system. Application to thermodynamics was further 
developed and established by Myron Tribus (1961, 1971). Thus not only is the 
information-theoretic entropy related to the thermodynamic entropy, but the laws 
of statistical mechanics were demonstrated to be processes of inferential reasoning 
rather than physical laws descriptive of the system itself. 

The last key insight, on which we will expound in this paper, was made by 
Richard Threlkeld Cox, who saw that the sum and product rules of probability 
could be derived from Boolean logic (Cox 1947, 1961). This was done by gen- 
eralizing Boolean implication among logical statements to degrees of implication 
represented by real numbers. Cox’s insight was key as it provided the first rigorous 
proof of probability theory as. an extension of logic. Jaynes recognized this and 
became a strong proponent of probability theory as extended logic and the basis 
for scientific reasoning, which he advocated in his tome, Probability Theory: The 
Logic of Science (Jaynes 2003) to be posthumously published this year. This new 
perspective on information-theoretic entropy and probability theory implies that 
the true information revolution (Solana-Ortega 2001) has only just begun. 

While the necessary framework laid in the middle of the twentieth century led 
to developments other than statistical physics, such as the Burg algorithm for spec- 
tral analysis (Burg 1967), application of the Bayesian methodology to more general 
scientific problems had to wait until the availability of sufficient computing power 
in the late 1970’s and early 1980’s. It was not until this time that the method- 
ology could truly prove its worth by outperforming standard techniques in many 
areas of research. Furthermore, inspired by Cox’s success in deriving the sum and 
product rules of probability from Boolean logic, much effort has gone into better 
understanding the foundations of probability theory and its relationship to another 
uncertainty- based area of physics — quantum mechanics. The following sections will 
introduce a modern picture of this foundation, demonstrate its application toward 
the automation of inference and inquiry, and look forward to the possibilities this 
methodology affords. 


2 . Posets and Lattices 

In this section we review the ideas behind partially ordered sets and lattices. This 
modern viewpoint (Davey & Priestley 2002), will allow us to easily relate the study 
of inference to the study of inquiry. The key concept required for this development 
is that we can take a set of objects and an appropriate ordering relation, and 
partially order the objects in the set forming what is called a partially ordered 
set or poset. We say partially order the objects, because it may be that some of 
the objects in the set are incomparable — like apples and oranges. As an example, 
consider the powerset of {a, 6, c}, which is the set of all possible subsets, written as 
p({a, 6, c}) = {0, {a}, {6}, {c}, {a, b }, {a, c}, {5, c}, {a, 6, c}}. We can order this set 
nicely with the ordering relation ‘is a subset off written for example as {a} C {a, 5}. 
As we mentioned, this is a partial ordering as some elements, such as {a} and {b, c}, 
are incomparable as neither one is a subset of the other. 

An important insight here is that for any given set of objects, there may be 
different ordering relations that can be used giving rise to different posets. To 
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Figure 1. Diagrams of lattices described in the text. a. Natural numbers 1, 2, 3, 4 ordered 
by ‘less than or equal to*, b. All subsets of {a, 6, c} ordered by ‘is a subset of’, c. Logical 
statements ordered by implication. 


generally express an ordering relation, we use the symbol < so that a < b is read 
as ‘6 includes a\ In our powerset example, < represents C. In the event that a < b 
and a ^ 6, we write a < b and read ‘a is properly contained in b\ Last, we can 
think of this partial ordering as imposing a hierarchy on the set of elements. When 
a < b, but there is no element x such that a < x < 6, then we can write a -< 6, 
which is read as l b covers a\ In this case 6 is an immediate superior to a in the 
hierarchy generated by the ordering relation. Another example is the set of numbers 
{1,2, 3,4} ordered by the usual l less than or equal to'. In this poset, 3 covers 2 as 
2 < 3, but there is no number x in the set where 2 < x < 3. 

The concept of covering allows us to illustrate the structure of the poset. First 
if a < b then b is drawn higher than a in the diagram. Second, if b covers a, 
a < b, then we connect a and b with a line. Figure 1 shows the diagrams for our 
posets ({1,2, 3, 4}, <) and (p({a,6,cj), C). Picking any element on the diagram of 
(p({a, 6, c}), C), one can immediately identify which elements contain it as a subset 
by following all of the lines upward from that element. Similarly, one can find all 
of the elements that it contains by following the lines downward. 

In figure lb, if you choose two elements in the diagram, say [a] and {&}, and 
follow the lines upward, the first common element that includes both {a} and {6} 
is {a, &}, which is called their join. The join of elements x and y is written generally 
as x V y, thus {a} V {b} = {a, b}. Dually, if we choose two elements, say {a, b} 
and {6, c}, we find that the first common element that they both include is {6}, 
which we call their meet The meet of two elements x and y is written x Ay. In 
the powerset example, the join of two elements can be found by taking their set 
union, and their meet can be found by taking their set intersection. However, join 
and meet correspond to other operations in other posets. 

If the meet and join always exist, and are commutative, associative, idempotent, 
and obey the absorbtion law, then the poset is called a lattice . Associated with each 
lattice is an algebra. By focusing on the hierarchical arrangement of the elements in 
the poset, one sees the structure as a lattice. Whereas by focusing on the join and 
meet as operations applied to its elements, one sees the structure as an algebra. 
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3. Boolean Lattices 

We now examine George Boole’s contribution to logic (1854) from the perspective 
of lattices. Consider a lattice based on a set of logical statements and ordered with 
the relation ‘implies’, which we write as a — ► b. Thus in this lattice < represents 
— The statements in the set will be generated from a smaller set of exhaustive, 
mutually exclusive statements. By exhaustive we mean that at least one of them 
is true; whereas by mutually exclusive we mean that if one of them is true, then 
the others are false. This gives us a basis set of statements (generally called atomic 
elements ) where we are assured that one and only one is true. As an example 
consider the possible atomic hypotheses representing accusations of who stole the 
tarts made by the Queen of Hearts, f 

a = 1 Alice stole the tarts! ’ 
k ~ ‘The Knave of Hearts stole the tarts V 
n = ( No one stole the tarts! 1 

We can construct new logical statements by combining two statements in two dif- 
ferent ways. First, the disjunction of two logical statements is a proposition that 
says what the two say jointly. One can think of the disjunction as being represented 
by the word ‘or’. By disjoining a and k above, we obtain a new statement, which 
can be shown to be the join a V A:, that says ‘ Either Alice or the Knave of Hearts 
stole the tarts!\ Notice that if ‘Alice stole the tartsT is true, then it implies that the 
disjunction is also true. Thus ‘ Alice stole the tartsP is included in ‘ Either Alice or 
the Knave of Hearts stole the tarts P so that a — ► a V k. The second operation, the 
conjunction , is a statement that tells what the two statements tell in common. This 
is represented by the word ‘and’, and is given in the lattice by the meet operation. 
Thus the conjunction of two logical statements a and b is a A 6. Because the logical 
symbols for disjunction V and conjunction A are identical to, and in this lattice 
signify the same operations as, the join and the meet respectively, it is important 
to remember that the meaning of the symbols for the join V and the meet A depend 
on the particular lattice. 

Figure lc shows the lattice diagram for all the possible disjunctions of the three 
atomic statements. Notice that implication is directed upward, so that if any lower 
element is known to be true all the connected elements above it are also known 
to be true. Working with the truth values of propositions on this lattice is called 
deductive logic. This type of lattice structure is called a Boolean lattice , and its 
associated algebra is a Boolean algebra. Boolean lattices have another interesting 
property — for every element x, there exists another unique element x f called its 
complement , such that x V x l — T, where T is the top element formed by the 
disjunction of all the atomic elements, and xAr'-i, which is the bottom element 
formed by their conjunction. Note also, that our powerset lattice has the same 
structure as the lattice of logical statements with the ordering relation — ►. They 
are both Boolean lattices and both have operations which follow a Boolean algebra. 

| Chapters XI and XII in Alice’s Adventures in Wonderland , Lewis Carroll, 1865. Lewis Carroll 
is of course a pseudonym for Charles Lutwidge Dodgson also a logician who did important work 
in symbolic logic. 
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4. Derivation of Probability from Logic 

Cox’s contribution (1947, 1961) was to generalize logical implication to degrees of 
implication represented by real numbers. This allows one to talk about the degree 
to which a statement a implies a statement 6, written as (a — > 6). To ease the 
transition to probability, we will write (a — ► b) as a function p(b\a) of the assertions 
a and b. The first goal is to figure out how to calculate the degree to which a premise 
i implies a conjunction of two statements a A b. Cox assumes that this is a function 
of the degree to which i implies a and the degree to which a A i implies b 

p(a A 6[z) = F[p(a\i),p(b\a A i)] (4.1) 

where F[-, «] is a function to be determined. This function will tell us how to do 
the calculation. It is found by maintaining consistency with Boolean logic. If we 
consider a statement formed from the conjunction of three propositions aAbAc, we 
can use associativity of the lattice to write this two ways: (a A b) A c or a A (b A c). 
Then we can use 4.1 above to rewrite the degree p(a AbA c\i) two different ways in 


terms of F. Consistency requires that they are equal 

F\p(a A b\i),p(c\(a A b) A 2 )] = Ffy(a\i),p{(b A c)\a A »)]. (4.2) 

Writing p(aAb\i) andp((6Ac)|aAz) above in terms of F, and substituting x = p(a| 2 ), 
y = p(b\a A 2 ), and z = p(c\a A b A 2 ), we get a simple functional equation 

F[F[x,y],z\ = F{z t F[y t z]] 9 (4.3) 

which has as its solution | F[x,y] = xy. So that from (4.1) we have the familiar 
product rule of probability theory 

p(a A b\i) = p(a\i)p(b\i A a), (4.4) 


which we see is required by consistency with associativity (Smith & Erickson 1990). 

The sum rule of probability is derived similarly by noting that the degree to 
which the premise i implies the complement of a statement a' depends on the 
degree to which i implies the original statement a: p{a'\i) = G\p(a\i)\. However, the 
complement of the complement of a statement is the original statement so we have 
the functional relation p(a\i) = G[G[p(a| 2 )]], which leads to 

p{a\i) +p(a'|t) = 1. (4.5) 

From this point on we will recognize that probability is a real number describing 
the degree of implication on a lattice of logical statements. Thus we have developed 
probability theory as a description of one’s state of knowledge regarding logical 
propositions. As described in the introduction, this approach is much more general 
than frequentist statistics, which regards probability as representing the frequency 
of event occurrences. 

Last, we consider commutativity of the conjunction, where p(aAb\i) = p(b A a\i). 
Applying the product rule to both sides we get p(a\i)p(b\a A i) = p{b\i)p{a\b A i). 
Solving for p{b\a A i) gives Bayes ' Theorem 

p(6|a A i) = v{b\i ) ~ » ( 4 - 6 ) 

t There is a more general solution, but it only serves to set the scale and offset of the logarithm 
of the probability. 
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which allows one to write p(b\a A i) in terms of p(a\b A i) thus turning around 
the inference. It is important to realize that these rules are the only logically con- 
sistent way to manipulate probabilities. Any other rules will eventually lead to a 
contradiction violating logical consistency. 

5. Automating Inference 

The humble origin of Bayes 5 Theorem belies the power that this relation wields. 
First we consider a hypothesis about a situation we wish to understand. This hy- 
pothesis could be a simple statement as in the Stolen Tarts example, or it could be a 
compound hypothesis formed by taking the logical conjunction of several hypothe- 
ses. This is useful in science when one has a parameterized model of a situation. We 
can conjoin hypotheses describing the model like hi — 1 Parameter p — 2.9/’, and 
h2 — ‘ Parameter q — SAP to form model = hi A h2. In such a situation, we have a 
hypothesis space defined by all possible hypotheses we could consider. In addition 
to our hypothesis, we may have some acquired data d = 1 1 measured r to have a 
value of 2.3 !\ The premise i represents our knowledge about the problem prior to 
obtaining new data. Rewriting Bayes 5 Theorem (4.6) by replacing a with data and 
b with model we get 

p(model\data Ai) — p(model\i) ^^YT~^ A ^ • (5.1) 

p(data \i) 

The first term on the right p(model\i), called the prior probability or prior , rep- 
resents the degree to which we believe the model is correct given only our prior 
information i. The term in the numerator p{data\model Ai) is called the likelihood , 
which represents the degree to which we believe that the situation described by 
the model could have resulted in the observed data. The term in the denominator 
p{data\i) is called the evidence and it represents the degree to which we believe the 
data could have been observed based only on our prior information. Finally, the re- 
sult on the left p(model\data Ai) is called the posterior probability , which describes 
how our initial state of knowledge p(model\i) is updated with the acquisition of 
new information. Bayes 5 Theorem is thus a learning rule that allows us to improve 
our state of knowledge as we gain new information! 

Keep in mind that these probabilities are not to be thought of as frequencies 
of event occurrences, but rather they represent the degree to which one believes a 
logical statement is true. This results in a much broader range of application. To 
utilize this methodology we must assign values to the necessary priors and likeli- 
hoods that appear on the right-hand side of the equation. The assignment of priors 
often causes concern among those who practice frequentist statistics. However, they 
assign likelihoods, which they call sampling distributions, and neglect priors, which 
assumes that they are the same for all hypotheses. How to assign prior probabilities 
in a logically consistent manner is an area of ongoing study. Jaynes (1967) gener- 
alized Bernoulli’s probability assignments by accounting for symmetries in one’s 
state of knowledge, which can be used in conjunction with Maximum Entropy to 
incorporate known constraints (Jaynes 1957, 1979). 

We will now demonstrate how this is used to develop a machine learning system 
that can use data to understand a physical system. We recommend that the inter- 
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Figure 2. a. Stars uniformly distributed in space are more probable to be distant from 
earth than near. b. The origin of stellar parallax, c. The resulting posterior is narrower 
than the prior — we have learned something, d. After incorporating 5 new measurements, 
the posterior has grown in amplitude (compare to prior), reflecting the high probability 
of the solutions, and has narrowed, reflecting a more refined solution. 


ested reader consult these practical works and tutorials (Bretthorst 1988; Loredo 
1990; Hanson 1993; Skilling 1998; Dose 2002), and especially Sivia (1996). 


(a) Stellar Distance from Parallax 

Consider a nearby star to which we would like to determine the distance d. 
Hypothesizing a value for this distance will constitute our model of this situation. 
Before we begin thinking about measurements, we know that nearby stars are 
approximately uniformly distributed in space. Thus the star has an equal chance 
to be in any given volume element of space up to a maximum resolvable distance. 
These little volume elements of space are the equal probability cases of Bernoulli. 
However, we want a probability for the distance to the star. At a given d , the star 
must be in a thin shell with radius d. The volume of these shells gets bigger with 
d 2 (figure 2a). So a prior probability p(d\i) oc d 2 reflects our expectation that stars 
are uniformly distributed in space. 

As the Earth orbits the Sun, the star’s apparent position in the sky changes 
with respect to the more distant stars (figure 2b). This measured angular position 
change 0 is called the parallax and will constitute our data. Parallax is inversely 
related to distance 6 = ^ > wi th 1 milli-arcsecond of angle corresponding to 1 parsec 
(3.26 lightyears). We can use this relation to predict a value for the parallax given a 
hypothesized value for d. With some knowledge of the errors of our measurements 
we can write the likelihood p(0\dAi) as a Gaussian distribution centered about the 
parallax predicted by d. One can think of the likelihood in terms of the forward 
problem where we start with our model, and compute what it predicts we should 
observe. The difference between the prediction and the observed data is represented 
by the likelihood. In cases where no analytic equation exists (like 0 = j), complex 
simulations must be employed to make predictions from the model. 

Using Bayes 5 Theorem, we plot the posterior probability (figure 2c), which is 
proportional to the product of the prior and likelihood, as a function of all the 
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Figure 3. a. From a set of Viking Orb iter images of the Martian surface, b. the average 
reveals no new information, c. while the inferred high- resolution model reveals a crater 
not resolvable in an individual image. 


possible values of distance. Plotted as a function of d the likelihood no longer looks 
symmetric, due to the inverse relation between d and 0. The important point is 
that the posterior is narrower than the prior, which means that we have ruled out 
possibilities and have learned something from the data. The posterior for the first 
datum point can now be used as a prior for a newly acquired second piece of data 
(figure 2c). Repeating this process several times results in a more certain value for 
the distance to the star (figure 2d). Bayes 5 Theorem thus allows us to automate an 
inference procedure, taking into account prior knowledge as well as new data. 

We are currently working on a more complex version of this problem where w r e 
are estimating the distances to planetary nebulae, which are the outer atmospheres 
of sun-like stars that have been thrown off during their collapse. These clouds of 
gas expand in time and we can use multiple images taken over time along with the 
Doppler shifts in their spectral lines due to their expansion velocity to simultane- 
ously model the three-dimensional structure of these objects while estimating their 
distances from Earth (Knuth &. Hajian 2002). 


(6) Super- Resolution Imaging of Martian Surface 

In this example we discuss a project from Cheeseman et al (1994) where multi- 
ple images of a planetary surface are used together to infer a super-resolved image. 
Figure 3a shows one image from a set of images taken by the Viking Orbiter of 
a particular area on the surface of Mars. These images were taken under similar 
lighting conditions, however the camera position and orientations were slightly dif- 
ferent each time. To obtain a better picture of the surface, one is tempted to average 
together the images (figure 3b). However, one does much better by considering a 
model image with a much higher resolution described by a set of model pixels, or 
mixels. The idea is to use the Viking images as data to infer the most probable 
mixel intensities in our model while simultaneously inferring the camera properties 
for each Viking image. Such super-resolution imaging works because each pixel in 
each image is an independent datum point describing the intensity emitted by a 
small patch of the surface. Given hypothesized mixel intensities and camera prop- 
erties, one can make predictions about the pixel intensities in the Viking images, 
which are represented by the likelihood function. The prior probability used reflects 
the expectation that neighboring mixel intensities are correlated. Using Bayes’ The- 
orem the posterior probability for any super-resolved model can be computed, and 
using efficient search algorithms, the most probable model can be found (figure 3c). 
This technique is remarkable as it reveals a newly discovered crater! 
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6. Questions and the Logic of Inquiry 

While the mathematics of inference has become well-understood in the twentieth 
century, we are only beginning to understand the mathematics of inquiry. Cox 
(1979), in his last work, defined a question as the set of all possible statements that 
answer it. To be assured that this set contains every statement that answers the 
question, the set must also include all statements that imply any statement already 
in the set, as these statements also answer the question. In lattice theory, such a set 
is called a down- set as it is generated by a given set of elements in the lattice and 
everything below them. Several key down-sets in the statement lattices are shown 
in figure 4. 

Given this definition, two questions are equivalent if they are answered by the 
same down-set of statements. Two such questions are 'Is it raining V and ‘Is it 
not raining?’. They are both answered by the same down-set generated by the 
statements l It is raining P and 'It is not raining !\ and thus are equivalent as they 
ask the same thing. Furthermore, we can impose an ordering relation on questions, 
as the set of answers to one question may be a subset of the set of answers to 
another. Consider the question T =‘ Who stole the tarts made by the Queen of 
Hearts all on a summer day?’, which I will write concisely using the set (see the 
down-set in lower right corner of figure 4) 

T — {a = ‘ Alice stole the tarts!’ (6.1) 

k = 1 The Knave of Hearts stoke the tarts!’ 
n = l No one stole the tarts!’} 

We can also consider the binary question B —'Did or did not Alice steal the tarts?’ , 
which can be written 

B — {a — ‘Alice stole the tarts!’ (6.2) 

a! = 1 Alice did not steal the tarts!’}. 

As the defining set of T is exhaustive, the statement ‘ Alice did not steal the tarts!’ 
is equivalent to the statement ‘ Either the Knave of Hearts or no one stole the 
tarts!’, written of — k V n. As B is a down-set (see figure 4), it must contain all 
the statements that imply a ', which are k and n. Thus the set T is a subset of B, 
and by answering the question T, we will have also answered the question B. The 
converse is not true as if we obtain as an answer to B that 1 Alice did not steal the 
tarts!’, then we still will have not answered T. 

At this stage, we are well on our way to constructing the lattice of all questions 
that can be asked relative to the issue ‘ Who stole the tarts ?’. With the ordering 
relation 'is a subset of’, or equivalently ‘ answers’ , we can show that the conjunction 
or meet of two questions is the intersection of the down-sets of statements answering 
each question, so that X AY = X C)Y. This results in a question which asks what 
the two ask jointly, thus earning it the name of the joint question . Similarly the 
disjunction or join of two questions, called the common question, is formed from the 
union of the two down-sets of statements answering each question, X V Y ~ X U Y , 
and as such asks what the two questions ask in common. 

The lattice of questions can be formed by considering all the possible down-sets 
of the assertion lattice and ordering them appropriately. As writing out a question 
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Figure 4. The lattice of all questions (center) generated by three mutually exclusive state- 
ments. Examples of the down-sets defining several questions are shown, including T and 
B discussed in the text (right). The ordering relation ‘is a subset of’ is applied to the 
down-sets of statements, so that lower questions answer higher questions. 


in terms of its assertions can be quite lengthy, we will use the following notation 
where A represents the down-set formed by descending the assertion lattice from 
the element a (figure 4), AK represents the down-set formed from the element aV k 
(figure 4), and AKN represents the down-set formed from the element a V kV n. 
The elements of the question lattice are then formed from all possible disjunctions 
of the questions A, K, N, AK, AN, KN, AKN. As an example, the binary question 
B in the example above can be written as B s A U KN denoting that its possible 
answers derive from a and k V n. 

Figure 4 (center) shows the question lattice for the three mutually exclusive 
statements in our stolen tarts example (Knuth 2002). This lattice is not Boolean, 
which indicates that questions do not possess complements. This particular lattice 
structure is known as the free distributive lattice, and as it is associative and dis- 
tributive, it possesses a measure analogous to probability, which following its own 
sum and product rules describes the degree to which one question answers another. 
For this reason it is called bearing or relevance. With this measure, one can com- 
pute the relevance that a question Q has on an outstanding issue /, denoted b(Q\I). 
The notation, introduced by Robert Fry, represents an upside-down p reflecting the 
inherent relationship between relevance on the question lattice and probability on 
the statement lattice. Thus we are introduced to the calculus of inquiry. 

While the exact relationship between probabilities on the statement lattice and 
the relevances on the question lattice are still being explored, the results we have 
obtained to date (Cox 1979; Fry 1999; Knuth 2002) suggest that relevance can be 
represented in terms of the entropy of the probabilities, and that the calculus of 
inquiry is a generalization of information theory. This has intuitive appeal as the 
probabilities then represent what is known, while entropies or relevances represent 
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what is not known. It also make sense from an information-theoretic standpoint 
where the design of a communication channel can be interpreted as the design of a 
question to be asked of the transmitter. This is also suggested by the relationships 
between the lattices (Knuth 2002) as the map from the statements to the questions 
acts like an exponential function, whereas the inverse map acts like the logarithm 
(Davey &, Priestley 2002). 

When probabilities of what is known can be used to compute relevances among 
that which we desire to know, we will be able to design machines that can identify 
missing information and request it. To do this, one will not need to draw all these 
elaborate lattice diagrams. All that will be necessary is the calculus of probability 
and relevance. This will be especially important in scientific problems where one 
is unable to ask a single question that will resolve an issue directly, and is instead 
limited to a given set of experimental questions. 

7. Generalization of the Methodology 

There are two very important realizations to be made. First, the concept of gener- 
alizing inclusion on a lattice to a degree of inclusion can be made on any kind of 
lattice. Thus we expect that there are other rules analogous to the sum and prod- 
uct rules we introduced here that exist in other disciplines. Ariel Caticha (19S8) 
has shown how the sum and product rules can be derived from associativity and 
distributivity, respectively, thus indicating that any lattice that has the distributive 
property has associated with it a degree of inclusion that follows a sum and prod- 
uct rule (Knuth 2003, unpublished work). In addition, the cross-ratio in projective 
geometry has been shown to have the same form as the odds-ratio in Bayesian infer- 
ence (Rodriguez 1991), which is now believed to derive from the fact that the pro- 
jective lattice also exhibits associativity (Knuth 2002). Fry has also demonstrated 
that this methodology is applicable to the area of control in cybernetic systems (Fry 
2002). Second, degrees of inclusion do not need to be represented by real numbers. 
Complex numbers and quaternions also conform to the consistency requirements 
(Youssef 1994; Youssef 2001, unpublished work), as do the more general Clifford al- 
gebras (Rodriguez 1991), which are multivectors in the geometric algebra (Hestenes 
& Sobczyk 1984) described in an earlier Millennium Issue (Lasenby et al 2000). 
Furthermore, Caticha (1998) has derived the calculus of wavefunction amplitudes 
and the Schrodinger equation entirely by constructing a poset of experimental se- 
tups and using Cox’s arguments of consistency with degrees of inclusion represented 
with complex numbers. This work leads to a very satisfying description of quantum 
mechanics in terms of measurements while explaining how it looks like probability 
theory — yet is not. We expect that the generalizations of lattice theory described 
here will not only identify unrecognized relationships among disparate fields, but 
also allow new measures to be developed and understood at a very fundamental 
level. 


8. Automating Both Inference and Inquiry 

Automation of both inference and inquiry will allow machines to learn from data, 
as well as ask relevant questions to obtain new data that could aid their inferences. 
This promises to automate the scientific method within a framework defined by a 
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set of possible experiments (questions that can be asked of a system) and a set of 
hypothesized theoretical models. Imagine a robot that has drilled through several 
kilometers of Europa’s icy crust to emerge into an immense ocean far from any 
possible intervention by its human creators on Earth. Having been programmed to 
resolve the issue ‘Is there life in Europa’s ocean?’, the machine, taking into account 
all of the relevant data on Europa that its creators have gathered, calculates the 
most relevant experimental question to ask given what it knows. Each experimental 
apparatus it carries also comes with its own energy cost, which the machine may 
also take into account. What is learned in its earlier experiments will help it decide 
which successive experiments to perform to resolve the scientific issue. 

While the creation of independently behaving, learning machines will undoubt- 
edly find great use in areas of science where humans are unable to perform experi- 
ments, they will most Likely pervade our lives in ways we have not yet imagined — 
most probably with the equal potential of being annoying as being helpful. While 
the methodology necessary to construct such thinking machines is becoming clear, 
we have yet to find a way to automatically generate new hypotheses for the machines 
to entertain. Such flashes of inspiration serving to change the way we perceive the 
world are not obviously related to any logical procedure, and often occur through 
generalizations and analogies, which at this point seem to fall outside the domain 
of the methodology discussed in this paper. 

This work was supported by the Intelligent Data Understanding Project/Intelligent Sys- 
tems Program and NASA Aerospace Technology Enterprise. 
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